Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delay health checks until transport socket secrets are ready. #13516

Merged
merged 14 commits into from
Oct 16, 2020

Conversation

mpuncel
Copy link
Contributor

@mpuncel mpuncel commented Oct 12, 2020

This fixes a sequencing issue that could result in undesirable Envoy
bootup times. When active health checks are configured on a cluster
using a transport socket that uses SDS to fetch secrets, it was possible
for the initial health check to run before secrets are loaded, which
results in a failure. The health check would then wait for the
no_traffic_interval (default of 60s) before trying again.

This commit makes health checks wait until secrets are loaded before
starting, implemented by adding support for registering a callback on
TransportSocketFactory. SslSocketFactory supports invoking the callback
when secrets are loaded, and all other socket types currently invoke the
callback immediately, preserving existing behavior.

Signed-off-by: Michael Puncel mpuncel@squareup.com

Commit Message:
delay health checks until transport socket secrets are ready.

This fixes a sequencing issue that could result in undesirable Envoy
bootup times. When active health checks are configured on a cluster
using a transport socket that uses SDS to fetch secrets, it was possible
for the initial health check to run before secrets are loaded, which
results in a failure. The health check would then wait for the
no_traffic_interval (default of 60s) before trying again.

This commit makes health checks wait until secrets are loaded before
starting, implemented by adding support for registering a callback on
TransportSocketFactory. SslSocketFactory supports invoking the callback
when secrets are loaded, and all other socket types currently invoke the
callback immediately, preserving existing behavior.

Additional Description:
Risk Level: Medium
Testing: Unit tests
Docs Changes: N/A
Release Notes: Updated
Fixes #12389

This fixes a sequencing issue that could result in undesirable Envoy
bootup times. When active health checks are configured on a cluster
using a transport socket that uses SDS to fetch secrets, it was possible
for the initial health check to run before secrets are loaded, which
results in a failure. The health check would then wait for the
no_traffic_interval (default of 60s) before trying again.

This commit makes health checks wait until secrets are loaded before
starting, implemented by adding support for registering a callback on
TransportSocketFactory. SslSocketFactory supports invoking the callback
when secrets are loaded, and all other socket types currently invoke the
callback immediately, preserving existing behavior.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@PiotrSikora
Copy link
Contributor

Drive-by comment, but wouldn't #13344 (keep clusters warming) also solve this?

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Oct 12, 2020

@PiotrSikora interesting, I was unaware of that PR! From a quick look, I don't think that PR actually delays health checks, so it does not address the problem from #12389. If there was a delay fetching a secret, Envoy would still enter the LIVE state in the middle of a no_traffic_interval which could be a minute long, so there could be up to a minute where Envoy is "ready" but has not performed any "real" health checks (meaning TLS secrets were loaded for it).

I do think this PR might coincidentally fix #11120 for some cases, because the cluster is not considered warmed until health checks have run for one interval. There would still be some edge cases, however, for instance if health checks are performed over a raw_buffer socket that does not need to load secrets, they would still run immediately and the cluster might be marked as warmed before all secrets are loaded. So that PR still has value as well.

@mpuncel
Copy link
Contributor Author

mpuncel commented Oct 12, 2020

That PR could maybe be modified to delay loadCluster until secrets have loaded though, which I think would then obviate the need for this one. That might block more of the cluster initialization process than is necessary though, such as DNS resolution or EDS

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@lizan
Copy link
Member

lizan commented Oct 13, 2020

@PiotrSikora interesting, I was unaware of that PR! From a quick look, I don't think that PR actually delays health checks, so it does not address the problem from #12389. If there was a delay fetching a secret, Envoy would still enter the LIVE state in the middle of a no_traffic_interval which could be a minute long, so there could be up to a minute where Envoy is "ready" but has not performed any "real" health checks (meaning TLS secrets were loaded for it).

Envoy will still enter LIVE state but the cluster won't be active, then no HC will be there, isn't that the case?

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
* master: (22 commits)
  http: using CONNECT_ERROR for HTTP/2 (envoyproxy#13519)
  listener: respect address.pipe.mode (it didn't work) (envoyproxy#13493)
  examples: Fix more deprecations/warnings in configs (envoyproxy#13529)
  overload: tcp connection refusal overload action (envoyproxy#13311)
  tcp: towards pluggable upstreams (envoyproxy#13331)
  conn_pool: fixing comments (envoyproxy#13520)
  Prevent SEGFAULT when disabling listener (envoyproxy#13515)
  Convert overload manager config literals to YAML (envoyproxy#13518)
  Fix runtime feature variable name (envoyproxy#13533)
  dependencies: refactor repository location schema utils, cleanups. (envoyproxy#13452)
  router:  fix an invalid ASSERT when encoding metadata frames in the router. (envoyproxy#13511)
  http2: Proactively disconnect connections flooded when resetting stream (envoyproxy#13482)
  ci use azp to sync filter example (envoyproxy#13501)
  mongo_proxy: support configurable command list for metrics (envoyproxy#13494)
  http local rate limit: note token bucket is shared (envoyproxy#13525)
  wasm/extensions: Wasm extension policy. (envoyproxy#13526)
  http: removing envoy.reloadable_features.http1_flood_protection (envoyproxy#13508)
  build: update ppc64le CI build status shield (envoyproxy#13521)
  dependencies: enforce dependency shepherd sign-off via RepoKitteh. (envoyproxy#13522)
  Add no_traffic_healthy_interval (envoyproxy#13336)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Oct 14, 2020

@lizan it looks like that PR doesn't change the flow until onClusterInit() is called, which is after the initial round of health checks has completed, so it doesn't solve my problem.

Even after that PR is merged, I believe this problematic sequence is still possible:

  1. Health checks are performed (and fail because secrets aren't loaded)
  2. (new functionality) cluster manager properly recognizes that the cluster isn't "warmed" and waiting on secrets load
  3. secrets load, Envoy goes LIVE, all clusters marked active.
  4. envoy goes LIVE, requests fail with "no healthy upstream". This period could be almost 60 seconds long because of the default for no_traffic_interval
  5. Requests succeed after the next health check interval

That said, that PR could maybe be modified to block for secrets updating before calling cluster.initialize() or something like that, but that will also delay EDS/DNS waiting on secrets that aren't necessary. Alternatively change the cluster API so it can do like cluster.startHealthChecks() after secrets load

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
lizan
lizan previously approved these changes Oct 15, 2020
Copy link
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge main branch?

* master:
  ci: use multiple stage (envoyproxy#13557)
  tls: update BoringSSL to 2192bbc8 (4240). (envoyproxy#13567)
  fix macos v8 build (envoyproxy#13572)
  Fixed Health Check Fuzz corpus syntax (envoyproxy#13576)
  ci: Remove shellcheck diff (envoyproxy#13560)
  ci: Increate brew retry interval (envoyproxy#13565)
  dependencies: fix some of the fallout from Wasm merge. (envoyproxy#13569)
  hds: add support for delta updates in specifier (envoyproxy#13067)
  ci: workaround for actions/runner-images#1811 (envoyproxy#13577)
  ratelimit: be able to disable x-envoy-ratelimited response header sent (envoyproxy#13270)
  Update opencensus library (envoyproxy#13549)
  ci: use azp for api and go-control-plane sync (envoyproxy#13550)
  docs: Remove/make generic lyft references in docs (envoyproxy#13559)
  check_format: adding 2 more release note checks (envoyproxy#13444)
  [Wasm] Add cluster metadata fallback and upstream host metadata (envoyproxy#13477)
  [fuzz] Added validation for secrets (envoyproxy#13543)
  Add Platform Specific Feature guidance to PR template (envoyproxy#13547)

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Oct 15, 2020

merged!

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@lizan lizan merged commit 34b67f9 into envoyproxy:master Oct 16, 2020
mpuncel added a commit to mpuncel/envoy that referenced this pull request Oct 17, 2020
* master: (22 commits)
  delay health checks until transport socket secrets are ready. (envoyproxy#13516)
  test, oauth2: Make sure config test runs field validation (envoyproxy#13496)
  [http] swap codec implementations to default new (envoyproxy#13579)
  wasm: update proxy-wasm-cpp-host (envoyproxy#13606)
  postgres: do not copy and linearize received data when it is not going to be used (envoyproxy#13393)
  configs: Update configs v2 -> v3 (envoyproxy#13562)
  http2: Remove RELEASE_ASSERTs in sendPendingFrames() error handling (envoyproxy#13546)
  dependencies: track untracked implied dependencies, wrapup dashboard. (envoyproxy#13571)
  listener: add match all filter chain (envoyproxy#13449)
  fix mistakes in docstrings (envoyproxy#13603)
  ratelimit: add route entry metadata to ratelimit actions (envoyproxy#13269)
  cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783)
  ext_authz: Avoid calling check multiple times (envoyproxy#13288)
  docs: Unexclude remaining configs from validation (envoyproxy#13534)
  build: update rules_rust to allow Rustc in RBE (envoyproxy#13595)
  docs: Update sphinxext.rediraffe (envoyproxy#13589)
  Deprecate moonjit support on Windows before beta (envoyproxy#13541)
  dependencies: bump LuaJIT to 2.1 branch HEAD @ e9af1ab. (envoyproxy#13474)
  docs: add TLS stats to cluster stats doc (envoyproxy#13561)
  ci: stop building alpine-debug images in favor of ubuntu-based debug image (envoyproxy#13598)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@rgs1
Copy link
Member

rgs1 commented Oct 19, 2020

x-posting in case someone else bumps into this issue, we had to revert this internally because it breaks (on a significant number of cases) hot restart for us (gets stuck in PRE_INIT).

@mattklein123
Copy link
Member

@rgs1 @mpuncel can we get this change reverted while we debug?

rgs1 pushed a commit to rgs1/envoy that referenced this pull request Oct 19, 2020
…envoyproxy#13516)"

This reverts commit 34b67f9.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1
Copy link
Member

rgs1 commented Oct 19, 2020

#13639

@mpuncel
Copy link
Contributor Author

mpuncel commented Oct 20, 2020

@mattklein123 fine with me! Where is the best place to discuss what might have gone wrong here and how to proceed? Should I open an issue or another PR after the revert is merged?

@mattklein123
Copy link
Member

@mattklein123 fine with me! Where is the best place to discuss what might have gone wrong here and how to proceed? Should I open an issue or another PR after the revert is merged?

I would reopen a PR, and then try to work with @rgs1 to debug? I don't immediately see how hot restart would change the behavior of this PR so will need some poking.

mattklein123 pushed a commit that referenced this pull request Oct 20, 2020
…#13516)" (#13639)

This reverts commit 34b67f9.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
mpuncel added a commit to mpuncel/envoy that referenced this pull request Oct 21, 2020
* master: (22 commits)
  ci: various improvements (envoyproxy#13660)
  dns: fix defunct fd bug in apple resolver (envoyproxy#13641)
  build: support ppc64le with wasm (envoyproxy#13657)
  [fuzz] Added random load balancer fuzz (envoyproxy#13400)
  dependencies: compute and check release dates via GitHub API. (envoyproxy#13582)
  mac ci: try ignoring update failure (envoyproxy#13658)
  watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. (envoyproxy#13103)
  typos: fix a couple 'enovy' mispellings (envoyproxy#13645)
  lua: Expose stream info downstreamLocalAddress and downstreamDirectRemoteAddress for Lua filter (envoyproxy#13536)
  tap: fix upstream streamed transport socket taps (envoyproxy#13638)
  Revert "delay health checks until transport socket secrets are ready. (envoyproxy#13516)" (envoyproxy#13639)
  Watchdog: use abort action as a default if killing is enabled. (envoyproxy#13523)
  [fuzz] Fixed divide by zero bug (envoyproxy#13545)
  wasm: flip the meaning of the "repository" in envoy_wasm_cc_binary(). (envoyproxy#13621)
  fix: record recovered local address (envoyproxy#13581)
  docs: fix incorrect compressor filter doc (envoyproxy#13611)
  docs: clean up docs for azp migration (envoyproxy#13558)
  wasm: fix building Wasm example. (envoyproxy#13619)
  test: Refactor flood tests into a separate test file (envoyproxy#13556)
  wasm: re-enable tests with precompiled modules. (envoyproxy#13583)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
mpuncel added a commit to mpuncel/envoy that referenced this pull request Apr 29, 2021
This commit is a revival of
envoyproxy#13516 which aimed to solve
envoyproxy#12389. That PR was merged and
then reverted due to a deadlock bug, which is still present in this
commit.

The fix will be applied in a subsequent commit in order to make the
resolution clearer to PR reviewers.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should HC activation be delayed until needed secrets are available?
6 participants